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Abstract 

The Candida Gene Order Browser (CGOB) was developed as a tool to visualize and analyze synteny relationships in 
multiple Candida species, and to provide an accurate, manually curated set of orthologous Candida genes for evolution- 
ary analyses. Here, we describe major improvements to CGOB. The underlying structure of the database has been 
changed significantly. Genomic features are now based directly on genome annotations rather than on protein sequences, 
which allows non-protein features such as centromere locations in Candida albicans and tRNA genes in all species to be 
included. The data set has been expanded to 13 species, including genomes of pathogens (C. albicans, C. parapsilosis, 
C tropicalis, and C orthopsilosis), and those of xylose-degrading species with important biotechnological applications 
(C tenuis, Scheffersomyces stipitis, and Spathaspora passalidarum). Updated annotations of C parapsilosis, C dubliniensis, 
and Debaryomyces hansenii have been incorporated. We discovered more than 1,500 previously unannotated genes 
among the 13 genomes, ranging in size from 29 to 3,850 amino acids. Poorly conserved and rapidly evolving genes 
were also identified. Re-analysis of the mating type loci of the xylose degraders suggests that C tenuis is heterothallic, 
whereas both Spa, passalidarum and S. stipitis are homothallic. As well as hosting the browser, the CGOB website (http:// 
cgob.ucd.ie) gives direct access to all the underlying genome annotations, sequences, and curated orthology data. 
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Introduction 

The Candida gene order browser (CGOB) was originally 
adapted from the yeast gene order browser (YGOB), a tool 
that facilitates visual comparisons and computational analysis 
of synteny relationships in yeasts from the Saccharomyces 
clade (Byrne and Wolfe 2005, 2006). The first version of 
CGOB (Fitzpatrick et al. 2010) contained 10 genomes from 
9 Candida species. Like YGOB, CGOB consists of a database, a 
browser, and a software engine for whole-genome evolution- 
ary analyses. The database consists of orthologous gene as- 
signments (pillars) that have been extensively manually 
curated, based on genomic context (local synteny) as well 
as sequence similarity, providing a "gold-standard" set of 
orthologs for evolutionary analysis. The browser is an inter- 
active tool for visualizing gene order relationships in any sec- 
tion of the genome. It displays a matrix (fig. 1) where each 
column shows a set of orthologous genes (a pillar) and each 
continuous horizontal element (a track) represents a segment 
of chromosome. The software engine allows the whole data- 
base to be searched for particular synteny-related patterns, 
such as sites where tRNA genes coincide with interspecies 
rearrangements (Gordon et al. 2009), without users having 
to manually browse through the whole genome. 

CGOB was designed to facilitate comparative analysis 
within the "CTG" clade of yeast species that translate the 



codon CTG as serine instead of the canonical leucine 
(Santos et al. 1993; Massey et al. 2003; Fitzpatrick et al. 
2006). This clade includes important fungal pathogens such 
as Candida albicans, C dubliniensis, C tropicalis, and C para- 
psilosis, which are diploid and either asexual or parasexual 
(Hull et al. 2000; Magee BB and Magee PT 2000; Bennett 
and Johnson 2003; Butler 2007, 2010; Buder et al. 2009). 
Related haploid and sexual species are also included, such 
as Claifispora (previously Candida) lusitaniae, Meyerozyma 
(previously Pichia) guilliermondii, Scheffersomyces (previously 
Pichia) stipitis, and Debaryomyces hansenii (also known as 
C jamata) (Fabre et al. 2005; Jeffries et al. 2007; Reedy et al. 
2009). The diploid species Lodderomyces elongisporus is more 
closely related to C albicans than to the haploid species, al- 
though there are some reports that it may have a sexual cycle 
(van der Walt 1966; Lockhart et al. 2008). CGOB was previ- 
ously used to identify clusters of genes associated with met- 
abolic pathways in Candida species (Fitzpatrick et al. 2010). 
Recently, we used CGOB to help annotate the genome of 
C orthopsilosis, a species closely related to C parapsilosis 
(Riccombeni et al. 2012), which has now been added to the 
browser database. We have also included the genomes of the 
xylose-fermenting yeasts C tenuis and Spathaspora passali- 
darum (Wohlbach et al. 2011). 

As well as increasing the number of species included, we 
also describe significant and fundamental changes that have 
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Fig. 1. CGOB display and bioinformatics tools. (A) CGOB screenshot. Each track represents a chromosomal region from one species. Tracks are labeled 
on the right. Each box presents a feature (with gene name and chromosome) and each color a chromosome (with different color palettes used for each 
genome). A change in track color indicates a break in synteny. Arrows under boxes denote relative orientation. White boxes represent tRNA features 
(amino acid and anticodon are displayed) and black boxes centromeres. Solid wide black connectors link adjacent genes and are continued in gray if 
there is a gap in that genome. Clicking these will output the intergenic sequence between the two features. When these connectors are colored orange it 
denotes an inversion (visible between the first two pillars in Candida parapsilosis and C. orthopsilosis). Double and single small bars connect nonadjacent 
genes <5 and <20 genes apart, respectively (not shown). The control console at the bottom of the interface lets users input a gene name, select the 
window size and the version to display, turn genomes on and off, and turn RNA features on and off. The display is centered on SEC34 from C. albicans, 
highlighted with a yellow box on the top track. (B) Information from CCD for the C. albicans SEC34 gene (launched by the "i" button on gene box in 
[A]). The equivalent button on the C. parapsilosis track connects to the same database, and on the Saccharomyces cerevisiae track launches the Yeast 
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been made to the structure and function of CGOB. All se- 
quence features are now based on genomic (nucleotide) co- 
ordinates, and include tRNA genes as well as protein-coding 
sequences. We have upgraded the browser interface to add 
significant new functionality. More importantly, we have sub- 
stantially improved the annotation of the Candida genomes, 
by systematic use of homology and synteny searches to iden- 
tify 1,525 previously unrecognized protein-coding genes. We 
also removed more than 1,000 predicted introns from 
S. stipitis gene models. CGOB is a powerful tool for the analysis 
of gene and genome evolution in the Candida clade, and is 
the only tool that facilitates comparative genomic analysis of 
human fungal pathogens and species with species important 
for biofuel production from wood waste. 

Results and Discussion 

New Genomes and Interface 

The original CGOB and YGOB databases did not contain any 
DNA sequence information or gene co-ordinates. They con- 
sisted of a static file of protein sequences from each species, a 
static list of the order of the corresponding genes along chro- 
mosomes in each species, and an editable (dynamic) orthol- 
ogy database containing information about which proteins 
were in each pillar. This organizational structure made it dif- 
ficult for curators to modify the data, even in cases where 
there were obvious errors in the annotation of a genome. We 
therefore switched both CGOB and YGOB to a framework 
where features (genes or other elements) are defined by their 
genomic coordinates. Each genome sequence is now loaded 
into the browsers at the DNA level, and all sequence infor- 
mation for a feature is generated dynamically from the 
corresponding chromosome sequence. Gene order is also cal- 
culated dynamically. Genome annotations can now be mod- 
ified easily — features annotated by the original authors of a 
genome sequence can be switched OFF by curators if neces- 
sary (or turned back ON), new features can be added, and the 
coordinates of any feature can be modified (fig. 1). 

The original version of CGOB contained 10 genomes from 
species from the Candida (CTG) clade (including two isolates 
of C albicans), plus Sacchammyces cerevisiae as a reference. 
We have now included updated annotations of C albicans 
(Bruno et al. 2010; Tuch et al. 2010), C dubiiniensis (Jackson 
et al. 2009), and D. hansenii (DEHA2 gene models) and made 
significant changes to the S. stipitis annotation (discussed 
later). We have also incorporated an annotation of C para- 
psilosis based on RNA-seq analysis (Guida et al. 2011) and the 
recently sequenced genome of C orthopsilosis (Riccombeni 
et al. 2012). Finally, we include the genomes of C tenuis and 



Spa. passalidarum (Wohlbach et al. 2011). Together with 
S. stipitis, C tenuis and Spa. passalidarum are among the 
few species that can ferment and assimilate the pentose 
sugar xylose, a major component of plant cell walls 
(Wohlbach et al. 2011). Xylose fermentation is required for 
the efficient use of plant material for biofuel production, and 
CGOB is the only tool that facilitates comparisons between 
pathogenic species and those that have important applica- 
tions for biotechnology. 

To improve the visualization of larger numbers of ge- 
nomes, we implemented the more streamlined browser in- 
terface shown in figure 1. To save vertical screen space and 
allow more genomes to be displayed, we moved genome 
names to the right edge of the screen and compressed the 
vertical space required for each genome by over a quarter. 
This flatter display creates space for extra genomes. We flat- 
tened the control panel at the bottom of the screen by 
making a drop-down list of species names that is used to 
select which genomes are displayed. This change to the con- 
trol panel also allows us to define a subset of species that will 
be used as the default group for display; chromosomal tracks 
from the other species will only be shown if a user chooses to 
activate them. We use this approach in YGOB, where the 
default display shows 26 tracks and another 7 are hidden 
by default, but not in CGOB where all 14 tracks from the 
current database can fit on most computer screens. 

Many of the bioinformatics tools in the original CGOB 
browser interface have been updated (fig. 1). Information 
("i" buttons) in C albicans and C parapsilosis launch the 
Candida genome database (CCD) (Costanzo et al. 2006), 
and for the Sac. cerevisiae track connects to Saccharomyces 
genome database (SGD) (Cherry et al. 2012) (fig. IB). A 
BLASTP search versus a database of all proteins in CGOB 
can be launched by clicking the "b" button on any gene's 
icon, but the query amino acid sequence is generated dynam- 
ically (fig. 1C). Users can now also rerun the BLASTP search 
with the SEG filter (which removes areas of low compositional 
complexity) off. Checkboxes in the BLASTP search results 
page allow users to select multiple genes from the results 
list, and then to retrieve their sequences (FASTA amino 
acid or nucleotide sequences), generate a multiple sequence 
alignment, draw a phylogenetic tree, or calculate levels of 
synonymous and nonsynonymous sequence divergence 
(using the same tools that can be launched from the CGOB 
interface, described later). This feature enables a user to com- 
pare or test genes that appear in BLAST results without 
having to manually extract their sequences, for example, to 
make a phylogenetic tree of a gene family. 



Fig. 1. Continued 

Genome Database (SGD). (C) BLASTP results for Sec34 versus all CGOB proteins (launched by "b" button on gene box). Pink shading indicates hits 
to genes that are in the same pillar as the gene used as a BLAST query. (D) Amino acid sequences for genes in the pillar (launched by "aa" button above 
the tracks). (£) Nucleotide sequences for genes in the pillar (launched by "nt" button above the tracks). (F) Intergenic sequence between 
cdub_CGOB_00003 and the adjacent centromere (launched by clicking on connector bar between them). (G) MUSCLE multiple sequence 
alignment of the proteins in the Sec34 pillar (launched from the "msa" button below the tracks). (H) Pairwise ynOO output for all genes 
in the SEC34 pillar (launched from "rates" below the tracks). (/) PhyML tree of genes in the SEC34 pillar (launched from the "tree" button below 
the tracks). 
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Links at the top of each pillar allow retrieval of the amino 
acid and nucleotide sequences of all genes in that pillar, ex- 
tracted dynamically (fig. ID and E). Users can now also output 
the intergenic DNA sequence between features by clicking on 
the black or gray connectors between them (fig. IF). Links at 
the bottom of each pillar allow users to generate a multiple 
sequence alignment using MUSCLE (Edgar 2004) (Fig. ^G), to 
calculate evolutionary sequence divergence between all pairs 
of sequences in the pillar using ynOO (Yang and Nielsen 2000) 
(fig. 1H) or to construct a PhyML phylogenetic tree (Guindon 
et al. 2009) (fig. 1/). The button in the bottom left hand 
corner of every CGOB page (fig. 1A) will output the same 
information seen on screen in a tab delimited text format that 
is easier to save and manipulate. 

CGOB recognizes both systematic names and synonyms 
for C albicans genes (taken from CCD; Costanzo et al. 2006). 
For C albicans WO-1, C tropicalis, L elongisporus, A/1, guillier- 
mondii, and Cla. lusitaniae the systematic names generated in 
the sequencing project (Butler et al. 2009) and used in CCD 
(Inglis et al. 2012) are recognized. The most up-to-date an- 
notations for C dubliniensis (Jackson et al. 2009), D. hansenii 
(DEHA2), and C orthopsilosis (Riccombeni et al. 2012) are also 
included. For C parapsilosis, CGOB recognizes gene identifiers 
from the most recent annotation (cpar2; Guida et al. 201 1), as 
well as from earlier annotations (CPAG; [Jackson et al. 2009] 
and cpar [Rossignol et al. 2009]). 

New Noncoding Features 

The original CGOB browser contained only protein-coding 
genes. We have now annotated transfer RNA genes across all 
the genomes, using tRNAscan-SE (Lowe and Eddy 1997). 
tRNA features are displayed on screen as white boxes (e.g., 
the leucine tRNA pillar in fig. 1A). This enables identification 
of association of tRNA genes with genomic breakpoints, as 
was hypothesized to have occurred during the acquisition of a 
proline racemase gene in the C parapsilosis lineage by hori- 
zontal gene transfer (Fitzpatrick et al. 2008). tRNA gene loca- 
tions are generally well conserved across species in the CTG 
clade. For example, we were able to identify probable ortho- 
logs in D. hansenii of 48% of the 126 tRNA genes in C albicans, 
based on conserved synteny with the nearby protein-coding 
genes. 

We also added annotations of ribosomal rRNA genes to 
the CGOB data set, based either on annotations by the orig- 
inal authors or on BLASTN searches with the C albicans 18S, 
5.8S, 25S, and 5S genes. The location of the rDNA array is 
conserved among C albicans, C dubliniensis, and C tropicalis, 
and (at a different site) among C parapsilosis, C orthopsilosis, 
and L elongisporus (Proux 2012). In other CTG clade species, 
rDNA arrays are present at species-specific sites (A/1, guillier- 
mondii, S. stipitis. Spa. passalidarum, and C. tenuis) or near 
telomeres (Cla. lusitaniae). We cannot find the locus in 
D. hansenii. It is clear that the rDNA array has moved 
around the genome during CTG clade evolution, but unlike 
the situation in the Saccharomyces clade (Proux-Wera et al. 
2013) we were unable to identify an ancestral rDNA location 



in the CTG clade or retrace the history of rDNA movement 
away from this site. 

Unlike species in the Saccharomyces clade that have 
"point" centromeres (conserved sequences recognized by 
specific proteins) and are therefore relatively easily identified 
by sequence analysis, the Candida species have "regional" 
centromeres, which are longer and more poorly conserved 
(Ishii 2009). Centromeres have been experimentally verified 
only in C albicans and C dubliniensis (Sanyal et al. 2004; 
Padmanabhan et al. 2008) and these are now included in 
CGOB. Centromere locations have been predicted in Cla. 
lusitaniae and S. stipitis (Lynch et al. 2010) but as these 
have not yet been experimentally verified they have not 
been included in the browser. As the non-coding and RNA 
features are not of interest to all researchers or to all questions 
they can all be turned off (and back on again) in the control 
panel. 

Gene Discovery 

For many of the Candida genomes (apart from C albicans) 
their initial gene annotation was performed automatically, 
with some support from synteny analysis (Butler et al. 
2009). We therefore suspected that like the Saccharomyces 
species (OhEigeartaigh et al. 2011), many protein-coding 
genes, particularly those with short open reading frames, 
were likely to have been missed. AAany may be important 
for function. In addition, analysis of gene loss and gain 
during evolution requires accurate information. We therefore 
used the SearchDOGS program originally developed for 
YGOB (OhEigeartaigh et al. 201 1) to search for missing ortho- 
logs. This program uses BLAST searches combined with syn- 
teny information to detect unannotated genes, for example, 
by re-examining the DNA sequence of a region that is anno- 
tated as intergenic in one species, but which is in the same 
genomic context (the same flanking genes on each side) as an 
annotated gene in one or more other species (fig. 2). We ran 
two iterations of SearchDOGS, followed by manual curation 
of all predicted open reading frames. This analysis identified 
1,525 new genes from all 13 genomes (table 1 and supple- 
mentary table SI, Supplementary AAaterial online). The new 
features have been entered into the CGOB database and 
identified with the prefix "CGOB," a short species identifier 
and a number (e.g., lelo_CGOB_00001 from L elongisporus, or 
wo-1_CGOB_00025 from C albicans WO-1). 

The 1,525 newly annotated genes (supplementary table SI, 
Supplementary AAaterial online) encode proteins ranging in 
size from 29 amino acids (ctro_CGOB_00073) to 3,850 amino 
acids (ctro_CGOB_00180). Although most are short (43% of 
them are <100 amino acids), some are surprisingly long (e.g., 
53 ORFs of > 1,000 amino acids were identified). We have 
identified new genes even in manually curated genomes like 
C albicans. Thirteen new protein-coding genes were pre- 
dicted in C albicans SC5314 and many more (114) in 
C albicans WO-1. These include Lsm5 (fig. 2), a component 
of two heteroheptameric complexes in Sac. cerevisiae that are 
involved in mRNA degradation and splicing (He and Parker 
2000). All the other components of the Lsm complexes were 
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GG- -PQ PTEEQKKLQEQ YAYDTLKFAGLVAGALWVSPIIVNFIRKHL* 
MYGG - - PQ PTEEQKKLQEQ YAYDTLKFAGIVAGALWVT P I VLNF I RKQL * - 
MYGG--PPPSEEQKKLQQQYAYDTLKFAGLIAGVLWVSPIVYHYIQRQLK* 
MYGG--PPPSEEQKKLQQQYAYDTLKFAGIVAGVLWVSPIVFHYIQRQLK* 
MYGG--PPPSEEQRKLQEQYTYDTLKFAGIAAGILWVCPHVLHFIQKQFK* 
MFGGPGSQPSEDQKKLQEKYAVDTLKTAGFIAGALWVAPIIFHYIKRQL*- 
MYGG--GQPTEEQKKLQEKMAYDTLKVAGFIAGALWVTPIIYHYIKKQF*- 

MFDG QVSEEQKKIQEQYATDTLKTAGIIAGTLWWPIIFHFVKRQFN* 

MFGG- - PQPSEEEKKLREKYAHDTLVFAGVLAGTLWVIPMVFHYFKKN* - - 
MFGAGAGQPTEEQRKAQEQYAYDTLKTAGAIAGLLWITPIVYHYIKKQF*- 
MFGGPGAQ PTEEQKKLQEQ YAYDTLKAAGL I AGALWVT P 1 1 FHWVKKQF * - 
MFGLPQQEVSEEEKRAHQEQTEKTLKQAAYVAAFLWVSPMIWHLVKKQWK* 



calb_CGOB_0 0 0 04 -MLPPGVILVFCLIFVAFLLVSGVFIQKKLKAR-KAQQl 

wo-i_CGOB_0 0 0 03 -mlppgvilvfclifvafllvsgvfiqkklkar-ksqqI 

cdub_CGOB_0 0 0 60 -MLPPGVILVFCLIFVAFLLVSGVFIQKKLKAR-KSQQ] 

Ctro_CGOB_0 0 0 82 -MFPAGVILVFCLLFCAFLI ISGVFIQKKIKAK-KSNQRF 

C P AR2 _602480 -MF P AGVI L VF C L I F AAF VL VS G VF I HKK I KAK - KS KQRF 

CORT0F03450 -MF P AGV I L VF C L L F AAF VL I S G VF I HKK I KAK - KS KQRF 

lelo_CGOB_0 0 031 -MFPAGVILVFCLLFVAFLIVSGVFIQKHFKAK-KSNQRF 

dhan_CGOB_0 0 0 07 -MLPRGVILVFVLVFLAFCAIVAVI AQKKIKAR-KSNQRF 

psti_CGOB_0 0150 -MLPPGVILVFVLIFLAFCAVTAVLVHKKLKEKAKSNNRF 

Cten_CGOB_002 46 -MLPPGVILVFVLVFLAFCAWGTITVNKLKAK-KSNQRF 

spas_CGOB_0 0097 -MLPAGVILVFCLIFCAFVAVMGVFIQKKLKAK-KSQQRF 

cgui_CGOB_0009 6 -MLPAGVILVFVLVFLAFCAVTAVMVQKKLKAK-KANQRF 

clus_CGOB_0 0163 -MFPAGVIWFVLVFLAFCAIVAIQIQKKLKAR-KANQ|[y 

s. c. PMPi mtlpggvilvfilvglaciaiiatiiyrkwqarqrglqH 



Fig. 2. Gene finding in Candida species. Examples of three small genes (orthologs of LSA/15, TOMS, and PMPI from Sacchawmyces cerewisiae) identified 
in multiple Candida species. The upper panels show screen shots from CGOB. The genes highlighted in red were identified by SearchDOGS. The lower 
panels show multiple alignments of the predicted proteins sequences carried out using SeaView (Gouy et al. 2010). The species are listed in the same 
order as in the top panels (S.c. = Sacchawmyces cerevisiae). 



correctly annotated in C albicans and most of the other spe- 
cies. However, LsnnS was missed in 7 genomes, probably be- 
cause it is short (77-101 amino acids) and not called in 
C albicans, which was used for annotating most of the 
other genomes. TomS, a component required for import of 
proteins into the mitochondria, was also missed from both 
C albicans species and 8 other genomes. It was called correctly 
only in C parapsilosis (based on transcriptional data [Guida 
et al. 201 1 ]), C orthopsilosis and D. hansenii. TomS is very short 
(47-50 amino acids), but very highly conserved (fig. 2). 

Figure 2 shows a further example of a very small gene, 
orthologous to GPAR2_602480 in C parapsilosis, which was 
added to 11 genomes including C albicans. The protein is 38 



amino acids long, highly conserved, and homologous to Pmpi 
from Sac. cerevisiae, where it functions as a regulatory subunit 
of the yeast plasma membrane H(+)-ATPase (Mousson et al. 
2002). GPAR2_602480 was annotated in C parapsilosis using 
transcriptional data (Guida et al. 201 1) and extended to other 
species using synteny information from GGOB. 

Many genes were not originally annotated because in- 
trons were not correctly assigned. We identified 381 novel 
genes containing introns (by bioinformatic analysis) and 
modified or added introns to 183 additional genes (table 1). 
A small number of unconserved open reading frames that 
overlapped with alternative conserved translations were 
removed. 
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Table 1. Numbers of Genes Added, Modified, and Removed 


in Each Species. 








Species 


Updated 


New Genes Added 


Existing Genes Modified 


Genes 




Number of Genes 


Total Genes 


1 n tro n 'Co ntai n i ng 


Total Genes 


Intron 


Removed 






Added 


Genes Added 


Modified 


Modified 




Candida albicans SC5314 


6,207 


13 


0 


2 


0 


7 


C albicans WO-I 


6,268 


114 


28 


28 


26 


9 


C. dubimiensis 


6,070 


88 


2 


13 


5 


0 


C tropicalis 


6,445 


192 


96 


32 


29 


8 


C parapsilosis 


5,843 


8 


2 


3 


0 


0 


C orthopsilosis 


5,707 


7 


1 


14 


1 


0 


L elongisporus 


5,931 


130 


57 


17 


17 


2 


Debaryomyces hansenii 


6,411 


12 


3 


8 


7 


1 


S. stipitis 


6,026 


211 


11 


31 


27^ 


0 


C tenuis 


5,800 


267 


9 


•J 


c 
D 


7 


Spathaspora passalidarum 


6,071 


93 


6 


31 


7 




M. guilliermondii 


6,135 


213 


91 


39 


32 


3 


Clavispora lusitaniae 


6,116 


177 


75 


39 


27 


4 


Total 




1,525 


381 


264 


183 


47 


^Number excludes ~ 1,200 in-frame "introns" that were 


removed from the S. stipitis annotation. 









Before the SearchDOGS iterations, the best-annotated ge- 
nomes were those of D. hansenii (12 additional genes pre- 
dicted), and C parapsilosis and C orthopsilosis (8 and 7 genes 
predicted respectively). The D. hansenii genome was se- 
quenced and annotated by the Genolevures consortium 
(Dujon et al. 2004; original gene identifiers beginning with 
DEHAO). Significant improvements in the annotation were 
later reported (gene identifiers beginning with DEHA2). The 
high quality of the current D. hansenii annotation is a reflec- 
tion of the substantial manual curation and the expertise of 
the consortium in annotating genomes of Saccharomycotina 
species, through the application of a web-based collaborative 
system (Magus; Martin, Sherman, et al. 2011). The current 
C parapsilosis genome annotation is based on transcriptional 
data, which was used to correct several hundred gene models 
and to identify 300 novel protein-coding genes with respect 
to the original automated gene calling (Guida et al. 201 1 ). The 
C orthopsilosis genome annotation is based on similarity and 
synteny data from CGOB (Riccombeni et al. 2012) and is an 
excellent illustration of the power of this approach. 

The annotations of the xylose-degrading yeasts, S. stipitis, 
Spa. passalidarum, and C tenuis, all sequenced and annotated 
by the Joint Genome Institute, presented the most problems. 
We identified between 93 and 267 new protein-coding genes 
in each species. In addition, while loading the genome data for 
S. stipitis (Jeffries et al. 2007), we noticed that a large number 
of gene models (1,611) included introns, with multiple introns 
(totaling 2,567) in single genes in several cases. This number of 
predicted introns is unusual, as species in the Saccharomyco- 
tina have very few (e.g., 415 in 381 genes in C albicans 
[Mitrovich et al. 2007] and 422 in 387 genes in C parapsilosis 
[Guida et al. 201 1]). None of the predicted introns in S. stipitis 
have experimental support. On closer investigation, we no- 
ticed that the many of the predicted introns were in-frame, 
and when included in the translation often generated a pre- 
dicted protein with greater similarity to orthologs in other 



species than the translation of the spliced gene model (fig. 3). 
We therefore carried out a systematic search for annotated 
in-frame introns in S. stipitis that are not conserved in other 
species. We removed 1,231 introns (not listed in table 1) that 
we believe are incorrectly annotated. Because this resulted in 
a major change in the genome annotation, we retained the 
original S. stipitis gene identifiers rather than introducing new 
ones for every modified gene. The genomes of C tenuis and 
Spa. passalidarum also contain high numbers of genes with 
predicted introns (974 and 994 respectively), which we sus- 
pect result from applying a gene finding model developed for 
filamentous fungi rather than from species in the Saccharo- 
mycotina (Jeffries et al. 2007; Wohlbach et al. 201 1). However, 
we have not systematically investigated nor removed introns 
in these two species. 

Finding Poorly Conserved Orthologs and "Hidden 
Homology" 

In assigning genes to pillars, we first use BLASTP searches and 
reciprocal best hits, with a conservative cutoff. However, 
poorly conserved orthologs can be difficult to identify 
based solely on protein similarity, and remain as "singleton" 
pillars instead of being incorporated into other pillars. To 
tackle this problem, we developed an algorithm called 
Synteno-BLAST, which interprets weak BLAST scores in com- 
bination with synteny information. Synteno-BLAST systemat- 
ically searches for putative orthologs by looking for singleton 
pillars that can be merged with another pillar on the basis of a 
BLASTP (£ < 1e— 5) hit to at least one gene in the other pillar, 
provided that the assignment is also supported by the synte- 
nic context. 

CGOB's Synteno-BLAST based approach reveals "hidden 
homology" between genes that cannot be found by BLASTP 
alone, and shows the importance of establishing orthology in 
the context of synteny. As much editing as possible is 
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Fig. 3. Invalid intron annotation in Scheffersomyces stipitis. Alignment of a region of Sla1 from Candida albicans, Debaryomyces hansenii, and two 
alternative gene models for S. stipitis. The red box highlights a region that was originally annotated as an in-frame intron (A), but where a revised model 
that ignores the intron increases sequence similarity (B). We also removed four other predicted in-frame introns in PICST_82711. 



automated. However, singleton genes with only very weak 
BLASTP hits (up to £ = 10) to a nearby pillar (and not neces- 
sarily having hits to all the genes in that pillar), that were not 
assigned to that pillar automatically, can be placed in it man- 
ually with sufficient syntenic evidence. This manual curation 
of orthology data provides one of the main strengths of the 
CGOB database. For example, 604 genes from D. hansenii, 580 
from C tenuis, and 334 from Spa. passalidarum were added to 
pillars following analysis of synteny. One example is shown in 
figure 4. The gene EEDl, which is required for filamentation in 
C albicans, was originally described as unique to this species 
(Martin, Moran, et al. 2011). Comparing C albicans EEDl with 
the C dubliniensis ORF (CD36_34980) shows the two proteins 
have significant regions of similarity, with some apparent de- 
letions in EEDl (fig. 4B). One C-terminal region missing from 
the C albicans protein is predicted to encode a SANT domain, 
a DNA binding domain shared by several chromatin remodel- 
ing machines (Aasland et al. 1996). The SANT domain is 
present in genes at the equivalent syntenic position in most 
of the other CTG species (fig. 4A, C). The C albicans and 
C dubliniensis proteins are significantly longer than the pre- 
dicted proteins from the other species. It is therefore likely 
that EEDl was present in the common ancestor of the CTG 
clade, but is rapidly evolving, and has undergone some parti- 
cularly significant changes in the C albicans lineage (e.g., loss 
of the SANT domain). As Eedl is a repressor of the hyphal-to- 
yeast transition in C albicans, the divergence in gene se- 
quence may be associated with the ability of this species to 
undergo true hyphal growth, a phenotype that is almost 
unique in the CTG clade. 



Identification of MTL Locus 

The term "Candida" means imperfect or asexual species. 
Although a parasexual cycle has been identified in C albicans, 
C. dubliniensis, and C tropicalis, resulting from mating of hap- 
loid and diploid cells, these species have never been shown to 
undergo meiosis (Pujol et al. 2004; Bennett and Johnson 2005; 
Porman et al. 2011; Hickman et al. 2013). No sexual cycle has 
been identified in some of the other diploid pathogens, such 
as C pampsilosis and C orthopsilosis (Logue et al. 2005; Sai 
et al. 2011). However, not all CTG clade species are asexual. 
Some, such as D. hansenii and S. stipitis, have haplontic life 
cycles — they undergo conjugation and almost immediately 
go through meiosis (van del Walt et al. 1977; Melake et al. 
1996). Mating and meiosis of haploid isolates of Cla. lusitaniae 
and A/1, guilliermondii has also been observed (Wickerham and 
Burton 1954; Reedy et al. 2009). 

Mating type is determined by the genes at the MTL, or 
mating type-like locus. Many Candida species (including dip- 
loid asexual species) have heterothallic MTL idiomorphs, and 
mating occurs between cells of opposite mating type, A/ITLa 
and MTLa (Buder 2007, 2010). Cell type is determined by the 
mating genes, al and a2 at the MTLa locus, and a^ and al at 
the MTLa locus. Alleles of other genes (PIK, PAP, and OBP) 
within the idiomorphs have no apparent roles in mating, but 
may be involved in biofilm development (Srikantha et al. 
2012). Some CTG clade species (such as D. hansenii and 
S. stipitis) are homothallic, with mating occurring between 
genetically identical cells. There are genes from both A/ITLa 
and MTLa (al, a2, and al) at a single locus in these species 
(Fabre et al. 2005; Buder 2010). 
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Fig. 4. EED1 orthologs in Candida species. (A) Screen shot from CGOB, showing synteny around the ££D7 gene from C albicans strain SC5314. The 
species are in the same order as shown in figure 1. In C albicans strain WO-1, there is a short gap in the genome sequence internal to EED1 with two 
frameshifting indels, making it a possible pseudogene (indicated with a "p"). (B) Alignment of putative EED1 orthologs from C albicans SC5314 and 
C dubliniensis, generated using T-coffee (Notredame et al. 2000). The boxed region highlights the SANT domain which is missing from the C albicans 
protein. (C) Alignment of SANT domains from Eed1 proteins excluding C albicans. 



Examination of the MTL loci of C tenuis and Spa. passaii- 
darum suggests that the former is heterothallic (with MTLa in 
the sequenced isolate) and the latter is homothallic (fig. 5A). 
The designation was not immediately obvious because the 



genes MTLal and MTLal in C tenuis and A/lTLal in Spa. 
passalidarum were not present in the original genome anno- 
tations. We also identified MTLal in S. stipitis. The structure 
of the MTLa locus in C tenuis is similar to the equivalent 
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Fig. 5. Identification of MTL loci. (A) The putative MTL loci from 
Candida tenuis, Scheffersomyces stipitis, and Spa. passalidarum are 
shown in comparison with the MTLa and MTLa idiomorphs from 
C. albicans. Introns are indicated with narrow rectangles. The structure 
of the C. tenuis idiomorph closely resembles MTLa from C. albicans. 
Both S. stipitis and Spa. passalidarum have homothallic-like structures, 
with mating genes from both MTLa and MTLa (B) Phylogenetic rela- 
tionship of species in the CGOB database, rooted using Saccharomyces 
cerevisiae. The tree was constructed using PhyML A bootstrap value of 
82% is shown for one branch; all other branches had 100% support from 
100 replicates. Various supertree methods gave either this topology, or 
an alternative one in which the positions of C. tenuis and Clavispora 
lusitaniae were swapped. Species that have lost MTLal are marked. 



idiomorph in C albicans (fig. 5A). As previously reported, the 
S. stipitis idiomorph contains a1, a2, and a^ genes at the same 
location, similar to D. hansenii (Butler 2010). Spathaspora 
passalidarum also has a homothallic structure, but has only 
one MTLa gene (a1) and both MTLa genes {a1 and a2). Like 
S. stipitis, the homothallic Spa. passalidarum MTL appears to 
have arisen from acquisition of information from an MTLa 
idiomorph integrated at an MTLa locus (Butler 2010). 

Earlier genomic analysis suggested that a2 was lost from 
the sexual species (D. hansenii, S. stipitis, A/1, guilliermondii, and 
Cla. lusitaniae) in the CTG clade, and this was correlated with 
differences in sporulation between Candida and Saccharo- 
myces species (Butler et al. 2009; Butler 2010). However, 
both C. tenuis and Spa. passalidarum have retained a2. 
Examination of the evolutionary relationship of the species 
(fig. 5B) shows that it is no longer clear that the loss of a2 
occurred on a single ancestral branch, and may indeed have 
occurred independently in several lineages. 



Conclusion 

We report here the development of new nucleotide-based 
data structures for GGOB and YGOB, and illustrate the ap- 
plication of syntenic information for gene discovery. We have 
greatly improved the annotation of most of the Candida 
species, though some errors still remain. In the entire GGOB 
database (78,505 protein-coding genes excluding Sac. cerevi- 
siae), 824 genes do not begin with an ATG, and 1,281 do not 
have an annotated stop codon. These are indicated by warn- 
ing messages in the GGOB pillars. The majority (>60%) of the 
problematic genes are from the genomes of S. stipitis, C tenuis, 
and Spa. passalidarum. However, the new annotations pro- 
vide an important tool for the research community; for ex- 
ample, browser pillar information has been used recently to 
help characterize Gene Ontology annotation in Candida spe- 
cies (Inglis et al. 2013) and to study xylose pathway evolution 
(Riccombeni et al. 2012). 

Materials and Methods 

Construction of Nucleotide-Based Databases 
Both YGOB (Byrne and Wolfe 2005, 2006) and GGOB were 
converted to nucleotide-based frameworks; only GGOB is 
described here, but similar changes have been implemented 
in both browsers. This was accomplished by replacing the 
static gene order lists and protein sequence files for each 
species with information calculated dynamically from 
genome sequences. For each species, we store a local version 
of the genome annotation and sequence, derived initially 
from NGBI, EMBL, SGD (Gherry et al. 2012), or GGD 
(Gostanzo et al. 2006) annotations. Each genome annotation 
file stores the following information for each gene or other 
genomic feature: 

- Name: the unique name used to identify the feature. 

- Orientation: 0 or 1 for Grick or Watson strand, respectively. 

- Start co-ordinate: the lowest-numbered coordinate in the 
range of the feature. 

- Stop coordinate: the highest-numbered coordinate in the 
range of the feature. 

- On/Off determines whether the feature is displayed in 
GGOB/YGOB. 

- Ghromosome/Gontig/Scaffold number: identifying 
number of source sequence. 

- Short Name: the shorter name that will appear in the fea- 
ture's on-screen box. 

- Coordinates: complete coordinates of the feature with 
intron/exon annotation and complement tag if 
appropriate. 

- Notes: tags imported from GenBank, GGD/SGD descrip- 
tions, or added by GGOB/YGOB curators. 

Nonprotein-coding features such as tRNA and rRNA genes 
and centromeres are annotated in an identical but parallel 
way to protein-coding features, with each type of feature 
having its own annotation file (with the above format). 
This parallel approach facilitates turning nonprotein-coding 
features on/off and the different on-screen and backend 
treatment of them, but most importantly stops the mixing 
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of different feature types in the same pillar. It also allows for 
the easy later addition of new feature types to a genome that 
was initially loaded without them. 

The genome annotation files for each species are associ- 
ated with a particular FASTA DNA sequence file containing 
the corresponding genome sequence. The On/OfF function 
allows us to choose to ignore certain features, such as "dubi- 
ous" genes that were present in the annotation we imported 
but which we do not wish to display in CGOB/YGOB, without 
losing trace of them (e.g., they are listed if the intergenic 
region they are in is examined). 

The annotation files and sequence file for each genome are 
then used as the source from which all other sequence infor- 
mation in the browsers is generated, including the amino acid 
and nucleotide sequences of individual genes, the DNA se- 
quences of intergenic regions, and the internal BLAST data- 
bases. The order in which the features are displayed on-screen 
is determined by the order of their Start coordinates. 

An editor feature allows CGOB/YGOB curators to modify 
the coordinates of features, or to create new features such as 
previously unannotated genes. Thus, we have the ability to 
modify the annotations we imported from other databases, 
but not to edit the genome sequence itself 

The structure of the database of homologous gene assign- 
ments across species (pillars) is unchanged from the original 
versions of CGOB and YGOB. We migrated pillar assignments 
from the protein-based versions to the nucleotide-based ver- 
sions of the databases to the greatest extent possible. 

Novel genes were identified by two iterations of 
SearchDOGS (OhEigeartaigh et al. 2011) and by manual 
investigation. 

Phylogenetic Analysis 

The species tree was constructed using PhyML (BLOSUM + 
I + r with 8 rate classes [Guindon and Gascuel 2003]) using as 
input 100,000 informative amino acid sites from proteins that 
are present in all 14 species, randomly chosen from Muscle 
alignments filtered by Gblocks (Castresana 2000; Edgar 2004). 
Other alignments were generated using T-coffee (Notredame 
et al. 2000) and ClustalW implemented through SeaView 
(Gouy et al. 2010). 

Supplementat7 Material 

Supplementary table SI is available at Molecular Biology and 
Evolution online (http://www.mbe.oxfordjournals.org/). 
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